Issue/253: feat: support offline int8 kv cache quantization by qinyiqun · Pull Request #254 · InfiniTensor/InfiniLM

qinyiqun · 2026-03-04T07:20:45Z

Support offline int8 kv cache quantization for static kv cache

examples/jiuge.py

csrc/cache/kv_cache.hpp

csrc/config/model_config.hpp

csrc/config/quant_config.hpp

csrc/models/llama/llama_attention.cpp

PanZezhong1725 · 2026-03-19T01:30:32Z

python/infinilm/cache/cache.py

-    def __init__(self, max_batch_size: int = 1, max_cache_len: int = 0):
-        _infinilm.StaticKVCacheConfig.__init__(self, max_batch_size, max_cache_len)
-
+    def __init__(self, max_batch_size: int = 1, max_cache_len: int = 0, kv_cache_dtype: str | None = None):


最好额外提供一个支持框架内dtype的接口，并在python里提供parse映射，让使用python的用户可以通过进入python文件看到各个字符串的含义

vllm中在flash attention前都使用字符串，所以使用字符串我感觉问题不大，有注明即可

csrc/pybind11/cache/cache.hpp

examples/jiuge.py

…quant.cpp; (2)update kv_cache_dtype handling; (3)Update Python test scripts

qinyiqun requested review from a team and wooway777 March 4, 2026 07:20

qinyiqun force-pushed the Issue/253 branch from 211913f to d3be4cc Compare March 4, 2026 07:21

Issue/253: feat: support custom KV cache dtype for quantization

1fc301f

qinyiqun force-pushed the Issue/253 branch from d3be4cc to 240464b Compare March 18, 2026 09:11

Issue/253: Support offline int8 inference with calibrated models

a2a2dac

qinyiqun force-pushed the Issue/253 branch from 240464b to a2a2dac Compare March 18, 2026 09:28

qinyiqun changed the title ~~Issue/253: feat: support custom KV cache dtype for quantization~~ Issue/253: feat: support offline int8 kv cache quantization Mar 18, 2026

wooway777 reviewed Mar 18, 2026

View reviewed changes

examples/jiuge.py Show resolved Hide resolved

PanZezhong1725 requested changes Mar 19, 2026

View reviewed changes

csrc/pybind11/cache/cache.hpp Outdated Show resolved Hide resolved

qinyiqun requested review from PanZezhong1725 and wooway777 March 19, 2026 08:32

InfiniTensor deleted a comment from qinyiqun Mar 20, 2026

qinyiqun force-pushed the Issue/253 branch from 4aa8c3e to e9e5b64 Compare March 20, 2026 02:32

pengcheng888 reviewed Mar 20, 2026

View reviewed changes

examples/jiuge.py Show resolved Hide resolved

Issue/253: (1) Refactor attention KV cache quantization to layers/kv_…

7796a76

…quant.cpp; (2)update kv_cache_dtype handling; (3)Update Python test scripts

qinyiqun force-pushed the Issue/253 branch from e9e5b64 to 7796a76 Compare March 20, 2026 03:10

wooway777 approved these changes Mar 20, 2026

View reviewed changes

PanZezhong1725 force-pushed the Issue/253 branch 2 times, most recently from 203620f to ae78252 Compare March 20, 2026 08:42

issue/253 refine static kv cache init

788532e

PanZezhong1725 force-pushed the Issue/253 branch from ae78252 to 788532e Compare March 20, 2026 09:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue/253: feat: support offline int8 kv cache quantization#254

Issue/253: feat: support offline int8 kv cache quantization#254
qinyiqun wants to merge 4 commits intomainfrom
Issue/253

qinyiqun commented Mar 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PanZezhong1725 Mar 19, 2026 •

edited

Loading

Uh oh!

qinyiqun Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

qinyiqun commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PanZezhong1725 Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qinyiqun Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

qinyiqun commented Mar 4, 2026 •

edited

Loading

PanZezhong1725 Mar 19, 2026 •

edited

Loading